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Abstract 

We focus on the statistics of word occurrences and of the waiting times between such 
occurrences in Blogs. Due to the heterogeneity of words' frequencies, the empirical 
analysis is performed by studying classes of "frequently-equivalent" words, i.e. by 
grouping words depending on their frequencies. Two limiting cases are considered: 
the dilute limit, i.e. for those words that are used less than once a day, and the 
dense limit for frequent words. In both cases, extreme events occur more frequently 
than expected from the Poisson hypothesis. These deviations from Poisson statistics 
reveal non-trivial time correlations between events that are associated with bursts 
of activities. The distribution of waiting times is shown to behave like a stretched 
exponential and to have the same shape for different sets of words sharing a common 
frequency, thereby revealing universal features. 
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1 Introduction 



Web logs, also known as Blogs, have become an influential medium (1l6l :ll3l:l3ll). 
that encompasses a broad variety of subjects, e.g. politics and science, and 
are participative by nature. They involve a huge number of interacting users 
that belong to several layers of the population, from topic specialists to aver- 
age people. This variety suggests that Blogs could be an efficient information 
source for identifying, tracking and modeling the spread of ideas and opinion 
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formation, for example in public debates over political questions. Indeed, the 
democratic nature of Blogs allows us to examine how trends develop from the 
interactions of decentralized bloggers and to follow dynamic opinion changes 
over a wide and diverse sample of the population. This is in contrast with the 
main media where relatively few journalists are involved. Precise knowledge 
of word statistics in Blogs is consequently of interest in order to make coher- 
ent statistical tests for automatically detecting critical events, e.g., trends or 



media shocks (llSl : Il9l ) 



The most basic time statistics ignoring correlations between events can be 
modeled by Poisson distributions. This distribution concerns independent events: 
the number n of events arriving during some time interval A occurs with a 
probability 



a 

P{n\a) = — e-^ 
n! 



where a is the arithmetic average number of events during this time interval. 
Moreover, the distribution of waiting times between two successive Poisson 
events is the negative exponential: 

/(r) = r-iexp(-r/r,), (2) 



where = A/a is the average characteristic waiting time between events. This 
distribution is well-known to apply to nuclear disintegration but it has also 
been used for describing the time gap s between shoppers entering a store (llTI ). 
the number of failure of products (15), the number of terrorist acts (jsol ) as well 
as the number of airplane accidents as a function of time (Q). An increasing 
amount of empirical evidence indicates, though, that human activity patterns 
do not fit this model. It has been shown by many other authors that human 
processes are rather heterogeneously distributed in time, with short periods of 
high activity ( 18 ; 19 ; 36 ) , or bursts, separated by long periods of inactivity (jl; 

29l ). This heterogeneity is characterized by a distribution 
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10; 14 



of waiting times which deviates from the exponential (2) and which, usually, 
presents a so-called heavy tail. 



In this paper, we focus on the statistics of such waiting times between word 
occurrences in Blogs (and other similar periodically updated web sources) and 
also on the statistics of the number of word occurrences per day. To do so, 
we focus on texts published in 68022 RSS feeds during a period of 214 days 
and analyze two limiting cases. On one hand, we focus on very rare "events", 
namely words that occur on average less than once per day. It is shown that 
the frequency of words is very heterogeneous, so that the time statistics have 
to be measured in classes of "frequently-equivalent" words, i.e. words are dis- 
criminated through their total number of occurrences during the whole time 
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period. This discrimination allows us to show that the distribution of waiting 
times deviates from the exponential (2), i.e. it is fitted by a stretched exponen- 
tial and therefore presents an overpopulated tail. The deviation from the pure 
exponential is evaluated with the quantity ( that measures the importance of 
the second moment of the time statistics. Interestingly, it is found that the 
shape of the distribution as well as the value of ( do not depend on the class 
of words in which they are measured. On the other hand, we focus on events 
that occur many times per day on average. In that case, scaling laws are ap- 
plied in order to smoothen the empirical results. Deviations from the Poisson 
statistics (1) are also found. Consequently, our results not only confirm that 



the dynamics of topics in Blogs present bursts of activity (llSl : Il9l : 1361 ) but 



they also provide tools in order to measure the importance of such bursts by 
comparing the empirical word statistics to a Poisson uncorrelated process. 



2 Data description 



2. 1 RSS format 



Really Simple Syndication (RSS) is an XML apphcation designed to deliver 
brief summaries of the most recent updates of web sites (jl6l ). although it is 
flexible enough to incorporate other applications such as reporting updates in 
digital libraries or search engine databases. Users with RSS reader software can 
subscribe to a range of RSS feeds based upon their interests, perhaps including 
favourite Blogs, some news sites or some special interest sites. The RSS reader 
will typically check each feed hourly and report to the user whenever new 
content is found. Each RSS feed contains a list of the most recent site updates, 
stored as separate XML items. When new content is added to the site, a 
new item will be added to the feed and the oldest one removed. Hence, when 
checking for updates, the RSS reader needs to parse feeds for items and report 
only items that are new, i.e. which were not in the feeds when they were 
previously checked. 



RSS is also an attractive format for large scale data collection and analysis 
because it is typically concise and easy to parse text. In addition, its contents 
are easily time stamped so that time series can be generated. In contrast, web 
pages are typically much less concise and much harder to parse. Moreover, 
time series are difficult to generate from such web pages because they typically 
reveal at best a last modified date (that is not automatically updated by the 
author). Like RSS, Blogs are more amenable to time series generation because 
each posting is dated and old postings are not normally modified. 
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Fig. 1. Total number of posts containing a specific word as a function of its rank, 
in log- log scale (a) and log- normal scale (b). The first ten most used words are, in 
decreasing order, {the, a, to, of, and, in, for, is, on, it). The curves show deviations 
from a power-law, but are quite well fitted (dashed line) by the modified power-law 
~ 1/(1 + 0.2x^-'^^ + 0.0004x^-^). 



2.2 Methodology 



In the following, we focus on the data collected from 68022 RSS feeds from 
February 11th 2005 to October 2nd 2005. The list of feeds was obtained pre- 
dominantly from Google, using its filetype:rss command, in conjunction with 
random mid-frequency words. The purpose of this method was to gain a wide 
range of types of sites supporting RSS feeds. A small proportion of the feeds. 
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about 1%, were extracted from manual browsing of the web and the nowa- 
days extinct completeRSS.com web site. Altogether, the feeds are predomi- 
nantly composed of Blogs, but also of other sources of online information, and 
they are mainly in English (estimated around 70-80%). At this point, it is 
also important to stress that the boundary between major news outlets and 
prominent Blogs has become blurred because the top bloggers have similar 
readerships as major newspapers. This difficulty justifies therefore our study 
of a heterogeneous collection, that encompasses several kinds of data sources, 
i.e. incorporating as well personal diary-like Blogs, professional specialist Blogs 
and newspaper RSS feeds. Let us also stress that one drastic event took place 
during the period under consideration, the London Attacks of 7 July 2005. 

Recall that each text pubhshed by a blogger is called a post and is made 
of a sequence of words separated by punctuation, a blank space or markup 
(e.g., HTML or XML). The data collection was performed as follows. Every 
24 hours, all feeds were scanned and their content compared with the content 
observed in the last scan. All new posts are attributed to the new scanning 
time. Over the time period of 234 days, we observed 2294672 different words in 
the data set. In Fig.l, we plot the number of posts containing a specific word 
as a function of its rank (the most frequently-occurring word has a rank 1, 
the second placed word has rank 2...). Let us remark the deviations from the 
power-law, i.e. Zipf law l/x'** and from the Zipf-Mandelbrot law 1/(1 -|- ax)" 



(l26l ). as those observed in (Il2l : l33l : |28| ) . In contrast, the empirical curve of the 
presently examined data is very well fitted by a modified power-law of the 
form 

1/(1 + aix^i +a2X^^), (3) 



where ai = 0.2, a2 = 0.0004, 71 = 0.65 and 71 = 1.5. Let us also stress that 



Eq. (3) includes two different characteristic exponents (p; l25|) and that it is 
reminiscent of Tsallis-like distributions (1321 ) . The main point for the rest of this 
paper is that the rank function of Fig.l behaves qualitatively like a power-law, 
which implies that the distribution of the number of posts also behaves like 
a power-law ([ll). Consequently, this distribution is very wide and not peaked 
around its average value, i.e. the number of posts fluctuates enormously from 
one word to another word. 



Before going further, one should also note that the above automatic scanning 
has been perturbed a few times due to technical problems, leading to gaps in 
time as seen in Fig. 2a. These missing scans have therefore led to the erroneous 
attribution of posts for the missing days and for the day that followed (see 
Fig. 2b). In order to perform a time analysis of word frequencies, we removed 
from the time series these anomalous data. After this data cleaning, there 
remained a 214 day time period. This cleaning does not change the shape of 
the curves of Fig.l, but reduces the systematic errors bars for the following 
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Fig. 2. In (a), time evolution of the number of performed scans. The discontinuities 
correspond to missed scans due to technical problems. In (b), we plot the measured 
number of different words as a function of time. Anomalous peaks are observed after 
each missed scan. These are removed from the analysis, as spurious data. 



waiting times study. 
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3 Word statistics 



3.1 Ensembles of equivalent words 



Let us label each word by the index a. The number of posts in which this 
word occurs on day i is noted Wai- Moreover = J2'i=i denotes the 
number of occurrences of a over the total time period. As discussed above, 
words may exhibit a large range of frequencies (1-10^). The spread of these 
frequencies may find its origin in many causes, e.g. the word "popularity" (two 
synonyms may be more or less popular) or " contextuality" (words associated 
to general and frequent contexts should be used more often). Such effects may 
be estimated by typing words in Google and counting the number of matches. 
For instance, synonyms like "clothes" and "garments" certainly have different 
popularities, as their Google matches are 136 x 10^ and 17 x 10^ respectively. 
Similarly, a word associated with a popular topic/context, e.g. "music", which 
occurs 951 x 10^ times, is used much more often than a word associated with 
a less popular topic, e.g. "tuberculosis" occurs 21 x 10^ times. 



It is well-known that heterogeneous events' frequencies may artificially over- 
populate the tail of the distribution of waiting times. In (j2), for instance, it 
is shown that such an effect may even lead to a power-law distribution of 
waiting times while the system evolves in fact like a time-dependent Pois- 
son process. This over-population originates from the fact that a distribution 
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Time (in days) 

Fig. 4. Distribution of waiting times for 4 ensembles of words E^, 
k = [25,45,85,105], i.e. belonging to the dilute limit function of time 

(in days): (a) in a log-normal scale and (b) in a log- log scale. The solid line is the 
power-law r~'^/^. 

whose characteristic time fluctuates Hke 

/(r) =< exp (-r/rj >r,= J dr^T'^ exp {-t /tc)p{tc), (4) 

where p(tc) is the probabihty that the characterstic time is r^, always ex- 
hibits larger fluctuations around the average waiting time than the Poisson 
distribution (2) does (0; Is]). In order to overcome this difficulty, we separate 
out words depending on their frequencies. Define the ensemble Ek of words 
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{ttj^, ...,ai„^}, that occur k times in the whole time interval. A word that is 
used only once is usually called a " hapax legomenon" , while a word used twice 
is a "dis legomenon", thrice, a "tris legomenon", etc. Let us also denote the 
number of words belonging to the ensemble by n^, i.e. it is the number of 
words a for which Wa = k. In the following analysis, we consider that all 
words belonging to the same ensemble Ek are a priori equivalent. This as- 
sumption seems reasonable a priori, as words in the same ensemble have the 
same average waiting time and should be more homogeneous than words ran- 
domly chosen in the whole set of used words. The validity of our assumption 
will be verified a posteriori by showing that waiting times are distributed in 
the same way in each ensemble Ek. 



3.2 Dilute limit 

Very rare words, i.e. words that occur much less than once a day on average, 
are ideal in order to test Poisson statistics, as the exponential distribution for 
waiting times (2) should fit them. Consequently, we focus in this section on 
the ensembles of words Ek, with k < 214, i.e. that occur on average less than 
once day. It is instructive to look at the distribution /(r) (see Eq.(4)) obtained 
without splitting words into classes, i.e. by averaging the distribution over all 
words occurring k < 214 times. From the shape of that distribution (Fig. (3)), 
one might conclude that time lags between word frequencies have a power-law 
distribution. We show in the following that this interpretation is erroneous 
and that the power-law shape is due to the averaging process described in the 
previous section. To do so, we measure the waiting time r between two succes- 
sive occurrences of one specific word in E^, for each ensemble Ek separately. 
The distribution /^(r) is then obtained by performing the analysis for each 
word in Ek. It is shown (Fig. 4) that the width of fk depends on the value of 
k (this is expected as each ensemble k is characterized by a different average 
frequency) and that fk produces a fat tail, i.e. anomalously large probabilities 
for very large and very short time intervals. This fat tail suggests that word 
dynamics are dominated by bursts of activities (jisi ) followed by long periods of 
rest in which the word does not appear. However, contrary to the distribution 
of Fig.3, the distributions fk are not well-fitted by a power-law but resemble 
stretched exponentials 



where u determines the shape of the distribution, C is a constant of integration 
and a determines the time scale - which all could be dependent on k. However, 
the exponent u is always found to be very close to = 1/2 for all the values 
of k. In that case, the constant of integration is C = a/2. 




(5) 
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Fig. 5. In (a), empirical Risk function for 4 ensembles of words E^, 
k = [25,45,85,105], as a function of the rescaled time tR. See deviations from 
the exponential. In (b), we plot the quantity C as a function of the ensemble k in 
which it is measured (see text for the definitions of C and k). The dashed lines point 
to the value for a Poisson process, i.e. ( = 2, and for the stretched exponential (9), 
i.e. C = 10/3. 



Let us now show that the shape of the distributions /fc(T) is universal. To do 
so, it is helpful to consider the Risk function Rk{t) 



(6) 



T=t 
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in order to improve the statistics. The quantity Rk{t) converges to zero for t — 
oo in the same way the time distribution /(r) does for the usual exponential 
and power-law time statistics (0). By construction, words in the ensemble 
are used k times in 214 days. Consequently, since the average waiting time 
< r >fc of such words is 



we change the time scale like t —>■ = t/{214/k). Empirical results for a 
large range of values of as a function of are shown in Fig. 5a and highlight 
deviations from the pure exponential, thereby confirming that correlations 
between word occurrences do not fit the Poisson hypothesis. Moreover, one 
observes that curves overlap for every k, thereby showing that the non-Poisson 
distributions are universal and that words belonging to different ensembles Ej. 
share the same statistical properties. Note that this is observed over a large 
range of /c G [25, 105]. 

In order to quantify the deviations from the exponential (2), it is useful to 
introduce the quantity 



where the average is performed over the distribution of waiting times. It is 
easily shown that (poisson = 2 when the process is Poisson, while it is larger 
than 2 if the fluctuations around the average waiting time are larger than those 
of a Poisson process. If the word occurrences were periodic, this quantity would 
go to zero. We have measured ( for different ensembles Ek, k G [25, 105]. It 
is shown in Fig. 5b that the empirical value is always larger than the Poisson 
value 2 and that it fluctuates around (empirical = 3.5. Interestingly, (k does not 
depend on the ensemble E^ in which it is measured, which implies that the 
fluctuations around the average waiting time are the same in all ensembles and 
therefore confirms the universality observed in Fig. 5a. Let us stress that the 
empirical value (empirical = 3.5 is very close the value of ( obtained from the 
observed distribution (5). Indeed, it is straightforward to show that < r >= 
6/a and < >= 120/a^ when 



< T >k= ^Tfkir) 



214 



(7) 



C =< > / < r >^ 
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(9) 



so that ( = 10/3 in that case. 
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Fig. 6. Rescaled probability distribution of the rescaled number of occurrences 
Eq.(lO) (+) that measures the deviations to the average < x >. The data 
were obtained by averaging with the proper rescaling over all words occurring 
k G [1000, 2000] times, i.e. belonging to the dense limit case. This scheme has also 
been applied to Poisson random data numerically generated for the same values of 
k (x). 



3.3 Dense limit 



For words occurring many times a day, it is rather meaningless to focus on 
the time lags between their occurrences, while a statistical analysis of the 
number of occurrences per day makes sense. In this limit (/c 3> 1), however, 
the number Uk of words occurring k times is very low (see Fig.l), so that a 
smoothing method is needed. Define p{x, k) to be the probability that a word 
occurs X times one day, if it occurs k times over the 214 days. By definition, 
the average number of day occurrences is < a; >= A;/214, but the width of the 
distribution is also expected to vary with k. From our data set, we can verify 
that the mean square displacement behaves like a ~ as expected. These 

two relations suggest to focus on the rescaled variable: 



and to the corresponding rescaled probability distribution. By doing so, the 
data are smoothened and the characterization of the probability shape is pos- 
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sible. 



In order to compare with Poisson events, we have generated numerically ran- 
dom ensembles Ek. This was done by randomly allocating k events into 214 
boxes. The following step consists in measuring the distribution p{x, k) and 
performing the above rescaling. As shown in Fig. 6, empirical data are less 
peaked around the average value, i.e. extreme events happen much more often 
than in the Poisson case. This over-representation leads to conclusions similar 
to those made in the previous section. In other words, even in the dense limit, 
bursts of activities occur. 



4 Conclusion 



In this article, we have performed an empirical analysis of the word frequencies 
arising in Blogs and RSS feeds. To do so, we have collected RSS data during a 
large time period (more than 200 days during spring 2005). These data encom- 
pass several kinds of information sources, such as newspaper RSS feeds and 
personal diary-like Blogs. Our analysis has been performed by discriminating 
words depending on their number of occurrences k. Namely, ensembles Ek of 
words occurring with the same frequency are defined and all words belonging 
to that ensemble are assumed to be "equivalent". This method is especially 
suitable when the frequency of word occurrences is very heterogeneous during 
the whole time window, as an heterogeneity of frequencies may radically alter 
the statistics of word occurrences. 



Two limits have been considered: a dilute limit that consists of sparsely used 
words and a dense limit of words used many times a day. In the dilute limit, 
we have analyzed the statistics of waiting times between two successive oc- 
currences of a word. It has been shown by using a proper rescaling that the 
distribution is the same for many ensembles Ek, thereby revealing a universal 
behaviour for word statistics. This universal distribution of waiting times has 
also been shown to deviate from the pure exponential, i.e. it behaves like a 
stretched exponential, and a statistical quantity ( has been introduced in or- 
der to measure these deviations. Deviations from the Poisson distribution are 
also observed for the number of word occurrences per day in the dense limit. 
Altogether, these deviations are associated with fat tails, e.g., a high proba- 
bility to observe extreme events, and suggest that word usage is dominated by 
bursts of activities followed by long periods of rest. Such bursts, which have 
also been observed in other social systems, e.g., Internet traffic (1361 ). email 
or web browsing (jssi ). may be caused by a response to an external triggering 
factor (e.g., US elections, publicity) (1221 ) or arise due to active endogenous 
discussions between bloggers. 
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Theoretical models reproducing the above empirical behaviour would be of 
great interest. Possible interesting ingredients include aging mechanisms (0; 
23l : 1371 ) that favour the realization of the most recent words as well as copying 
mechanisms in which people would have a tendency to use the words used by 



their acquaintances (1201 : l2ll : I 111 : |24J ). 
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